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ABSTRACT 

Our genome contains tens of thousands of long noncoding RNAs (IncRNAs), many of which are likely to have genetic regulatory 
functions. It has been proposed that IncRNA are organized into combinations of discrete functional domains, but the nature of 
these and their identification remain elusive. One class of sequence elements that is enriched in IncRNA is represented by 
transposable elements (TEs), repetitive mobile genetic sequences that have contributed widely to genome evolution through a 
process termed exaptation. Here, we link these two concepts by proposing that exonic TEs act as RNA domains that are 
essential for IncRNA function. We term such elements Repeat Insertion Domains of LncRNAs (RIDLs). A growing number of 
RIDLs have been experimentally defined, where TE-derived fragments of IncRNA act as RNA-, DNA-, and protein-binding 
domains. We propose that these reflect a more general phenomenon of exaptation during IncRNA evolution, where inserted 
TE sequences are repurposed as recognition sites for both protein and nucleic acids. We discuss a series of genomic screens 
that may be used in the future to systematically discover RIDLs. The RIDL hypothesis has the potential to explain how 
functional evolution can keep pace with the rapid gene evolution observed in IncRNA. More practically, TE maps may in the 
future be used to predict IncRNA function. 

Keywords: long noncoding RNA; IncRNA; transposable element; transposon; repeat element; genome; evolution; functional 
domain 



INTRODUCTION 

One of the great surprises from the past decade of genomics 
has been the discovery of many thousands of long noncoding 
RNA (IncRNA) transcripts: The latest gene count in human 
has reached 13,000 (Gencodel8) (Derrien et al. 2012); and 
with improving gene annotations, as well as rapidly increas- 
ing volumes of RNAseq data (Hangauer et al. 2013), it is like- 
ly that it will soon exceed that of protein coding genes. We do 
not yet know what proportion of IncRNAs in these annota- 
tions are true genes (Graur et al. 2013) and which are simply 
transcriptional noise (van Bakel et al. 2010). However, evolu- 
tionary evidence (Ponjavic et al. 2007) and a growing roster 
of experimentally demonstrated cases (Amaral et al. 20 1 1 ) ar- 
gue for a substantial core of bona fide genes that fulfill the 
strictest definitions of function. Based on a growing body 
of literature, IncRNAs would appear to primarily function 
as regulatory molecules both in the nucleus and cytoplasm 
through a wide repertoire of mechanisms, including interac- 
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tion with epigenetic protein complexes (Rinn et al. 2007) and 
transcription factors (Kino et al. 2010), and hybridization to 
complementary RNA (Gong and Maquat 2011) or DNA se- 
quences (Simon et al. 201 1). This has opened new avenues in 
the study of human disease and biological processes (Faghihi 
et al. 2008; Gupta et al. 2010; Ng et al. 2012). Despite this pro- 
gress, we still only have experimental information for about 
130 or 1% of annotated IncRNAs (Amaral et al. 2011). In 
part, this is due to our lack of understanding of fundamental 
aspects of IncRNA biology, most notably the relationship be- 
tween sequence and function, and our consequent inability 
to predict IncRNA function based on informatics analysis. 
To crack this sequence-function code, we must understand 
and categorize the active domains of IncRNA, what is their 
mechanism of action, and how they are combined to yield 
a functional molecule. 

In this article, we propose that one of the keys to under- 
standing RNA function lies in the transposable element 
(TE) sequences that they abundantly contain. Specifically, 
we will argue that TEs contribute preformed structural and 
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sequence features that impart on IncRNA the ability to inter- 
act with and regulate other molecules. By rapidly and contin- 
uously shuffling such domains within new and existing 
IncRNAs, TEs have the potential to explain the evolution of 
complex IncRNA regulatory networks. 

THE CHALLENGE AND PROMISE OF MAPPING 
IncRNA FUNCTIONAL DOMAINS 

It was recently proposed that IncRNA follow a modular orga- 
nization, like proteins, composed of discrete domains that in 
combination determine the IncRNA's function (Guttman 
and Rinn 2012). This is an attractive hypothesis with various 
conceptual and practical implications. In evolutionary terms, 
domain organization explains how insertion or rearrange- 
ment of functional subunits can alter the function of existing 
genes or create novel ones relatively rapidly, through reuse of 
existing functional sequence rather than continual de novo 
evolution. Given that domains usually originate as duplica- 
tions from a reduced number of canonical types related by 
structure and function (at least in proteins) (Koonin et al. 
2002), we can identify them from primary sequence analysis, 
and classify them using sequence or structural similarity. 
Moreover, we may use this information to predict the func- 
tion of novel genes by analysis of their primary sequence. 
Modular organization implies having distinct functionalities 
encoded by discrete sequence regions, separated by flexible 
linkers, and independent of context (Guttman and Rinn 
2012). In IncRNA, functional domains are likely to act in at 
least two distinct ways: (1) adoption of a specific secondary 
structure that mediates the interaction with a protein partner; 
and (2) sequence-based hybridization to another nucleic 
acid. In this review, we use the term domain rather loosely, 
to include any clearly defined and self-contained region that 
confers upon its host transcript some biological activity, in- 
cluding functional structures or sequence motifs that interact 
with other molecules, but also regions that influence traffick- 
ing or processing, such as miRNA binding sites. 

At present, our understanding of IncRNA domains and 
domain organization is limited to a small number of molec- 
ular biological and biochemical studies. These generally sup- 
port the modular view, showing that IncRNAs are organized 
into discrete units at structural and functional levels, which 
retain their biological activity when separated from the rest 
of the molecule. An excellent case in point is represented 
by XIST, a 17-kb 8-exon transcript that is expressed from 
and represses one copy of the female X-chromosome in eu- 
therians (Brown et al. 1991). A series of 7.5 repeats, termed 
A-repeats, are necessary for chromosomal silencing through 
recruitment of the PCR2 repressor complex (Zhao et al. 
2008). Although the solution structure of the A-repeats has 
been the topic of debate (Wutz et al. 2002; Zhao et al. 
2008; Maenner et al. 2010), the latest evidence suggests that 
the two halves of each repeat play distinct roles: The 5' unit 
forms a highly stable hairpin structure, whereas the 3' por- 



tions form intermolecular hybrids with their counterparts 
from the other repeats (Duszczyk et al. 2011). This silencing 
domain is distinct from localization activity, which is encod- 
ed by dispersed elements elsewhere in the transcript and 
which are unaffected by 5' deletions (Wutz et al. 2002). 
One advantage of working with XIST is the possibility to 
do functional studies using cell lines overexpressing variants 
of an XIST transgene, where the impact of mutations on 
function is read out by measuring resultant changes in X- 
chromosome silencing and cell survival (Wutz et al. 2002). 
Such studies show that sequence mutations that do not 
alter the A-repeat structure have weak effects on function, 
whereas mutations affecting structure result in abrogation 
of XIST-mediated silencing (Duszczyk et al. 2011). This im- 
plies that, at least in the case of A-repeats, function depends 
in large part on RNA adopting the correct structure, regard- 
less of sequence. Finally, the A-repeat region's function is 
independent of context, since a shorter XIST isoform, termed 
RepA, is also capable of interacting with PRC2 in vivo (Zhao 
et al. 2008). 

Other functionally validated IncRNA also have modular 
organization. HOTAIR has two protein-binding domains 
at the 5' and 3' end that bring together two distinct repressor 
complexes, PRC2 and REST, respectively, at sites of gene re- 
pression (Tsai et al. 2010). Another HOX locus transcript, 
HOTTIP, recruits WDR5 chromatin remodeling protein 
through a domain at its 5' end (Wang et al. 2011). The well- 
studied SRA coactivator transcript represents a case of struc- 
tural modularity: Here the whole transcript would appear to 
be necessary for transcriptional activation, but the distinct 
structural subunits that contribute to this activity are them- 
selves modular (Novikova et al. 2012). Thus, IncRNAs appear 
to be hubs where nucleic acids and proteins can be brought 
together, and it is precisely their domain structure that under- 
lies this. 

At present, we have no method of systematically identify- 
ing IncRNA functional domains. The development of such 
methods is hindered by a number of factors, most obviously 
the aforementioned small number of validated cases to be 
used as training sets. The ability to identify IncRNA domains 
would represent a major breakthrough because it would en- 
able us to predict a priori the functions of the many thou- 
sands of IncRNA now known. In the case of proteins, this 
is now straightforward: Clearly identifiable primary, second- 
ary, and tertiary sequences can be used to predict molecular 
activity and infer function, and such prediction for novel pro- 
tein sequences is routine (Baker and Sali 2001). Although 
many methods exist for predicting RNA secondary structure 
with varying accuracy (Zuker 2003), we cannot presently link 
these to function. 

Some progress has been made toward large-scale IncRNA 
functional prediction through a number of approaches. 
Recently, Glazko et al. (2012) trained a SVM predictor for 
IncRNA interactions with the Polycomb complex on human 
data, which seems to be effective in predicting mouse 
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interaction data. The predictor was trained on the Khalil et al. 
(2009) PRC2 RIP-chip data and identified a combination of 
k-mers, TRANSFAC motifs, and sequence complexity that 
was enriched in the PRC2 -binding RNAs compared to non- 
binders. The method correctly identified known binders such 
as XIST and HOTAIR. However, it remains unclear how 
these classifiers relate to the true underlying mechanism of 
lncRNA-PRC2 recognition; and indeed, it remains formally 
possible that the classifier was identifying some other con- 
founding aspect of IncRNA behavior, such as expression lev- 
el, rather than specific PRC2 recognition. 

Computational methods have been published for predict- 
ing protein-lncRNA interactions but these have not been 
extensively validated with high-throughput experimental 
data (Bellucci et al. 2011; Muppirala et al. 2011; Wang et al. 
2013b). Encouragingly, methods developed recently, such 
as iCLiP and RIP-seq, are providing large-scale experimental 
maps of protein-RNA interactions, which include IncRNA 
and may offer clues to function (Yang et al. 2011). Similar 
to Glazko et al.'s (2012) results, these protein-binding data 
sets tend to identify short sequence motifs in binding sites. 
In light of their low specificity, it is not clear whether these 
motifs alone specify binding, or whether larger but cryptic se- 
quence features also specify binding. 

Although promising, the preceding methods do not yet 
yield large-scale information on IncRNA functional domains. 
Results from Glazko et al. (2012), as well as various iCLiP 
data sets, indicate that IncRNA molecular interactions are en- 
coded in discrete sequence features that can be identified 
informatically, and these features are modular in the sense 
that they have similar functions in a wide number of 
IncRNA settings. One key feature of functional sequences is 
that they should be stereotypical — they should have similar 
sequence features in a large number of IncRNAs. We might 
take advantage of this observation to search for candidate 
functional elements by searching for overrepresented se- 
quence features in IncRNA. 

TE SEQUENCES ARE ABUNDANTLY FOUND 
IN IncRNA EXONS 

An obvious group of repeated sequence features within 
IncRNA are transposable elements (TEs). TEs are represented 
by various classes of repetitive, mobile sequence elements of 
varying origin and evolutionary age that constitute between 
one-half and two-thirds of our entire genomic sequence 
(Lander et al. 2001; de Koning et al. 201 1). Previously regard- 
ed as purely parasitic elements, it is now broadly acknowl- 
edged that TEs play fundamental roles in cellular processes 
and in the evolution of genetic novelty (Cordaux and 
Batzer 2009). The evolutionary process by which TE sequenc- 
es are subverted for novel function by the host genome is 
known as "exaptation" (de Souza et al. 2013). There is exten- 
sive literature demonstrating that TEs have contributed re- 
peatedly and profoundly to the evolution of genome 



structure and function through the insertion of preformed se- 
quence elements, both at the level of genomic DNA, e.g., tran- 
scription factor binding sites (Johnson et al. 2006), splice sites 
(Sela et al. 2010), enhancer elements (Huda et al. 201 lb), and 
promoters (Huda et al. 2011a), and at the level of RNA, e.g., 
microRNA genes (Spengler et al. 2014), recognition elements 
(Piriyapongsa and Jordan 2007), and protein-coding domains 
(Bowen and Jordan 2007). 

Recently, a number of studies have highlighted an intrigu- 
ing relationship between TE sequences and long noncoding 
RNA. A large proportion of exonic IncRNA sequence has 
originated from TEs: Based on a mixed IncRNA annota- 
tion from RNA sequencing and GENCODE, Kelley and 
Rinn (2012) estimated that 41% of IncRNA nucleotides are 
derived from TEs, and the majority of IncRNAs (83%) con- 
tain at least one TE fragment. As a consequence, many ma- 
ture IncRNA transcripts contain combinations of multiple 
repeat fragments reminiscent of protein domain structures 
(Fig. 1). 

Particular families of TEs are strongly and nonrandomly 
enriched or depleted from IncRNA sequence: Kelley and 
Rinn (2012) found a particularly strong overrepresentation 
of human endogenous retrovirus (hERV) families in 
IncRNA exons compared to the genomic background, but 
other classes such as LTR subtypes and MLT are also enriched. 
In contrast, families including the highly numerous Alu, LI, 
and L2 classes are significantly depleted from IncRNA. 
These patterns suggest that the presence of TE fragments 
within mature IncRNA sequence might have been selected 
for or against during evolution. 

TEs have had a profound influence on IncRNA gene 
structure, particularly in terms of regulatory regions and 
splice sites. In another recent paper, Cedric Feschotte and 
colleagues found numerous examples in which IncRNA pro- 
moters, splice donor, and splice acceptor and polyadenyla- 
tion sites are composed of TE-derived sequence (Kapusta 
et al. 2013), echoing a previous study demonstrating wide- 
spread alternative promoter contributions by TEs (Faulkner 
et al. 2009). The TE content of IncRNA genes far exceeds 
that of protein-coding genes, almost certainly due to the in- 
ability of protein-coding sequence to tolerate insertions (Sela 
et al. 2010). Kelley and Rinn (2012) went further to show that 
the 127 IncRNAs promoted by HERVH elements are specif- 
ically up-regulated in pluripotent cell types (Kelley and Rinn 
2012), which is consistent with previous observations of 
the overexpression of these elements in human embryonic 
stem cells (Santoni et al. 2012). Indeed it is likely that TEs 
such as HERVH are actually responsible for the birth of 
new IncRNAs by the insertion active promoters into previ- 
ously inactive genomic regions (Kelley and Rinn 2012). It 
is worth noting that HERVH is among the most enriched 
elements in IncRNA exonic sequence. Thus, TEs contain pre- 
formed sequence motifs that have driven the evolution of 
IncRNA gene structures and indeed to the evolution of new 
IncRNAs. 
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TE sequence within IncRNA is the rule 
rather than the exception, and high levels 
of TE insertion are compatible IncRNA 
activity. 

What is less clear from these studies, 
however, is to what extent TEs have con- 
tributed to functional sequence within 
the IncRNA transcript itself. Indeed, 
given cases like linc-RoR, it is possible 
that, far from impairing function, TEs 
are necessary for IncRNAs molecular 
activity. The enrichment (and indeed 
depletion) of particular TEs would ap- 
pear to argue that they have been selected 
for or against within IncRNAs, and thus, 
their presence has directly contributed 
IncRNA function. 
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HYPOTHESIS: TRANSPOSABLE 
ELEMENTS AS FUNCTIONAL 
DOMAINS OF IncRNAs 

The abundant and nonrandom insertion 
of TE into IncRNA exons reviewed above 
leads us to propose the following related 
hypotheses: 



FIGURE 1. Examples of TE insertion profiles in annotated IncRNA. Insertions are represented 
by arrows, colored by TE class. The rectangles represent mature IncRNA transcripts and are 
to scale. 



The presence of TE sequence within a IncRNA does not 
appear to be detrimental, and clear cases exist of repeat- 
rich, functional IncRNAs. The transcript linc-RoR, identified 
in pluripotent embryonic stem cells, is capable of increasing 
ESC reprogramming efficiency when included with the stan- 
dard Yamanaka factors (Loeweretal. 2010). The mature linc- 
RoR transcript is composed of ~70% TE- derived sequence 
from multiple families. Although the location of this func- 
tionality within linc-RoR has not been mapped, the extent 
of repetitive sequence in the transcript, as well as the observed 
link between TEs and pluripotency, is highly suggestive of a 
role of endogenous retroviral sequence in promoting pluri- 
potency (Santoni et al. 2012). 

Several other repeat-rich IncRNAs have been described. In 
mouse, a brain-specific transcript AK046052, regulated by 
the master neural transcriptional repressor REST, is largely 
a mosaic of TE-derived sequence (Johnson et al. 2009). 
Kelley and Rinn (2012) also highlighted a number of func- 
tionally characterized IncRNAs, such as TUG1 (Young 
et al. 2005) and BANCR (Flockhart et al. 2012), that contain 
significant amounts of TE-derived sequence. Perhaps the 
most compelling example comes again from XIST, whose 
TE content has actually increased in the human lineage since 
its evolutionary repurposing from a protein-coding gene 
(Elisaphenko et al. 2008). Overall, we might conclude that 



The set of TE insertions within IncRNA 
exons contains a subset of biologically 
active sequences that are important for 
IncRNA function; and 
TE insertion is a general evolutionary mechanism by which 
IncRNA functionality evolves through the combinatorial 
addition of distinct TE domains that result in emergent 
and complex properties in their host IncRNA. 

Together these hypotheses can help to explain one of the out- 
standing questions regarding IncRNAs: How can these genes, 
which are born over relatively short evolutionary timescales, 
rapidly acquire molecular activity and play new functional 
roles? A newly expressed, nonfunctional IncRNA may tran- 
scribe a preexisting TE fragment. Alternatively, a TE may be 
inserted within an existing, functional IncRNA. In either 
case, if the TE sequence in question has some kind of biolog- 
ical activity, it may confer that activity on the host IncRNA and 
at a small but definite frequency confer a selective advantage. 

How could TE-derived sequences contribute to IncRNA 
functionality? We next consider two principle alternatives 
(Fig. 2). First, within the IncRNA, the TE sequence continues 
to perform a similar function as that for which it evolved 
in the ancestral TE, most likely through protein binding 
(Blackwell et al. 2012). Alternatively the TE sequence might 
mediate hybridization to other, homologous nucleic acid 
sequences (Gong and Maquat 2011). In summary, we pro- 
pose two principle classes of functional TE sequence within 
IncRNA (Fig. 2). 
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Type I: Protein binding 

Endogenous repeat RNP IncRNA with repeat fragment 




Type I la: RNA binding 

Repeats in multiple mRNAs 

Type lib: DNA binding 

Repeats in multiple genomic loci 

FIGURE 2. Functional classification of exonic TE insertions: (black) 
IncRNAs; (ovals) proteins; (gray) interacting mRNAs/lncRNAs/geno- 
mic DNA; (arrows) gene promoters. 

Type I: protein interaction 

In the course of their normal cellular lifecycle, TE transcripts 
interact with a variety of proteins, both self- encoded and 
host-encoded, to form a ribonucleoprotein complex (RNP) 
(Goodier et al. 2013). RNPs, such as Alu or LINE1, have 
been shown to interact with a diverse range of host proteins, 
including chromatin modifiers, transcription factors, DNA 
repair factors, RNA binding proteins, and RNA Polymerase 
II (Mariner et al. 2008; Blackwell et al. 2012; Goodier et al. 
2013). It is reasonable to infer that fresh insertion of TE 
repeats within IncRNA may confer binding to the same com- 
plexes, thereby constituting preformed protein-binding 
domains. Among those protein classes recently found to in- 
teract with Alu and LINE1 are many, such as chromatin reg- 
ulatory complexes, that are highly relevant to known 
functional roles of IncRNA (Blackwell et al. 2012; Goodier 
et al. 2013). Thus, there is a relationship, at least at early evo- 
lutionary stages, between the TE's activity in the IncRNA con- 
text and its role in its original TE context. 

Type II: nucleic acid interaction 

Repeat elements might also confer functionality through 
their sequence alone, and its ability to specifically hybridize 
to the multiple other copies of the same repeat element 
that exist, by definition, throughout the genome. In contrast 
to Type I, the functionality of such sequence is not necessarily 
related to its functionality (if any) within the endogenous TE. 
The specificity of this interaction will depend both on the 
length of the TE fragments, as well as their originating from 



the same fragment of the TE consensus sequence. Such hy- 
bridization may occur through Watson-Crick base-pairing 
by the IncRNA-embedded repeat to either DNA or RNA se- 
quences (Gong and Maquat 2011): 

1. Type Ha: RNA binding 

Inserted TE could confer sequence-specific RNA-bind- 
ing modules through simple complementarity. The ad- 
vantage of this is that such binding would occur in 
sequences derived from the same or related TE families 
on the opposite strand of the target molecule, thus en- 
abling the evolution of a large repertoire of highly similar 
target sequences of extended length, and hence specificity 
(Fig. 2). An example of this is targeting of mRNAs for 
Staufen-mediated decay by IncRNA through Alu-mediat- 
ed complementary base-pairing ( Gong and Maquat 2011). 

2. Type lib: DNA binding 

It is likely that IncRNA are capable of interacting direct- 
ly with genomic DNA sequence through conventional 
Watson-Crick base-pairing or through alternative modes 
such as Hoogsteen base-pairing (Buske et al. 2012). As 
in the aforementioned case of RNA, the abundance of 
near identical TE elements within genomic DNA offers 
a plausible model whereby complementary interactions 
mediated by embedded TE sequences with DNA could 
target IncRNAs to specific genomic loci (Fig. 2). This mod- 
el has been proposed for Alu fragments within the ANRIL 
IncRNA (Holdt et al. 2013). 

The precise functionality of IncRNA, like protein, resides 
in the combinatorics of its constituent functional domains. 
In other words, different combinations of the TE-derived do- 
mains mentioned above could give rise to IncRNAs with dif- 
ferent regulatory abilities (Fig. 3). For example, multiple 



A RNA-Protein RNA-RNA RNA-DNA 




FIGURE 3. (A) Activities of distinct RIDLs. (B) Evolution of diverse 
IncRNA functions through TE integration. 



www.rnajournal.org 963 



Johnson and Guigo 



distinct protein-binding sites would function to unite pro- 
teins or protein complexes (such as HOTAIR) (Tsai et al. 
2010). The combination of RNA binding domain with pro- 
tein binding could give rise to a regulator of mRNA process- 
ing, represented by Uchllas for example, whose antisense 
domain specifically targets Uchll mRNA, while its SINEB2 
repeat potentiates translation (Carrieri et al. 2012). Finally, 
we propose a hypothetical RNA-DNA adaptor configuration 
that might serve to recruit other ncRNAs (or even mRNAs) to 
specific genomic locations (Fig. 3). 

The combination of a DNA-binding sequence with a pro- 
tein-binding domain might give rise to a "transcription 
factor" IncRNA that recruits gene regulatory or epigenetic 
complexes to defined genomic regions (for example, Fendrr) 
(Grote et al. 2013). Presumably, this is how HOTAIR func- 
tions, since it is known to interact with the chromatin regu- 
latory complexes, such as PRC2 and REST, and its binding 
sites as mapped by CHIRP (Chu et al. 2011) contain an en- 
riched GA-rich sequence motif that might be recognized by 
HOTAIR itself (although it has not been definitively resolved 
whether HOTAIR directly binds to this motif, or how). Here 
we have discussed the simplest two-domain combinations, 
but an essentially infinite variety of possible combinations 
between nucleic-acid and protein-binding domains exist. 

Within IncRNAs, these TE-derived domains would be 
expected to be interspersed with poorly conserved linker 
regions, as proposed by Guttman and Rinn (2012). Further- 
more, one might expect that the extensive alternative splicing 
witnessed in IncRNA genes (Derrien et al. 2012) might give 
rise to transcripts with various combinations of protein- 
and nucleic acid-binding domains. 

In the following sections, we discuss first the experimental 
evidence supporting this hypothesis, the implications for our 
understanding of IncRNA evolution, and finally, some meth- 
ods for the systematic discovery of TE-derived IncRNA func- 
tional domains. 

TRANSPOSABLE ELEMENT RNA 
IS BIOLOGICALLY ACTIVE 

There is a growing body of experimental literature that sup- 
ports the idea that TE fragments within IncRNA contribute 
to function. These include cases, discussed below, in which 
TEs have clear, RNA-based biological activity either in iso- 
lation (this section), within the context of another RNA 
molecule (principally mRNAs), or direct evidence of func- 
tional TEs within IncRNA (next section). In this section, 
we discuss the former case, in which there is evidence for in- 
trinsic biological activity for natural TE RNA sequence. These 
cases have particular relevance where we propose that host 
IncRNAs acquire aspects of the original activity of their TE 
repeats. 

TE transcripts have been shown to have activity at the 
whole-cell level as well as in human diseases and at the mo- 
lecular level. In addition to being activity transcribed in cell 



compartment (Goodier et al. 2010), developmental (Rowe 
and Trono 2011), and tissue-specific patterns (Faulkner 
et al. 2009), many TE insertions are under purifying evolu- 
tionary selection (Lowe et al. 2007). There is evidence for bi- 
ological activity of repeats from a range of classes, from the 
large, autonomous long interspersed nuclear elements 
(LINEs), through various virally derived long terminal repeat 
(LTR) sequences, and to the nonautonomous short inter- 
spersed nuclear elements (SINEs). 

There is a range of evidence attesting to the activity of Alu 
sequence at both the DNA and RNA levels in both healthy 
and diseased tissues. This highly numerous, short (300 nt), 
structured element is derived from the 7SL signal recognition 
particle RNA and has expanded massively in the primate lin- 
eage (Lander et al. 2001; Giordano et al. 2007; Mariner et al. 
2008). It was recently shown that age-related macular degen- 
eration arises from aberrant Dicer processing in the retina, 
leading to the accumulation of Alu transcripts, which results 
in toxicity and consequently retinal neuronal degeneration 
(Kaneko et al. 2011). A recent screen for binding partners 
of Alu sequence discovered a diverse repertoire of protein 
partners, including a number of chromatin remodeling fac- 
tors and transcription factors (Blackwell et al. 2012). Indeed, 
ongoing work by Kugel and Goodrich have demonstrated 
that Alu and other SINE transcripts are capable of binding 
and repressing RNA Polymerase II activity through the adop- 
tion of a modular structure, thereby repressing global gene 
transcription during heat shock (Mariner et al. 2008). These 
data suggest that Alu transcripts may directly participate in 
genomic regulatory processes through protein interactions. 

Alu are not alone in their abundant expression and clear 
phenotypic effects on their host cells. A recent study also 
found that Lib retrotransposons are associated with the 
chromatin modifying complexes that maintain neocentro- 
meres (Chueh et al. 2009). More evidence for binding to pro- 
tein complexes comes from a recent analysis of TDP43, the 
RNA binding protein involved in multiple neurodegenerative 
conditions (Li et al. 2012). Here, the authors showed that 
TDP43 is bound by a wide variety of TEs in both human 
and mouse neural cells, and this association is disrupted in 
disease, raising the possibility that differential protein bind- 
ing by TE transcripts may play a role in neurodegenerative 
processes. Finally, we recently showed that transposable ele- 
ments are globally derepressed in cancer, suggesting that their 
expression contributes to malignancy (Ferreira et al. 2014), 
possibly by inserting and altering transcription of proto-on- 
cogenes or tumor suppressors (Shukla et al. 2013). 

TE transcription appears to be a normal and regulated pro- 
cess during development. In mouse preimplantation blasto- 
cysts, LTR-type transposons are actively transcribed and 
contribute many cell-stage-specific promoters to other genes 
(Peaston et al. 2004), reminiscent of ESC-specific expression 
of HERVH-driven promoters (Kelley and Rinn 2012). In un- 
differentiated neural precursor cells of human and mouse, 
LINE1 elements are globally derepressed, resulting in cell- 
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specific insertion events and genetic mosaicism in adult neu- 
rons (Muotri et al. 2005; Baillie et al. 2011). 

Although far from conclusive, these data together suggest 
that TE RNAs may play causative roles in fundamental cellu- 
lar processes. Furthermore TEs have many of the hallmarks 
of functional ncRNAs: modular structural organization, pro- 
tein interaction, specific trafficking within the cell, and evo- 
lutionary conservation. 

DIRECT EVIDENCE FOR TE-DERIVED FUNCTIONAL 
DOMAINS IN IncRNAs 

The hypothesis that TE sequences can act as functional do- 
mains of IncRNA has recently gained support from a number 
of experimental studies, which provide examples for all but 
one of the scenarios outlined in Figures 2 and 3. These cases 
are discussed below, and summarized in Table 1 . 

We commence with the longest studied and most clearly 
functionalized IncRNA, XIST, whose key role in mammalian 
genetics is underwritten by at least three distinct, repeat-de- 
rived functional domains. Silencing by XIST strictly depends 
on the presence of the 5' repetitive A- repeat domain, which is 
conserved across eutherians (Wutz et al. 2002; Elisaphenko 
et al. 2008). Deletion of this region ablates the repressive 
function of XIST, while leaving its targeting largely unaffect- 
ed (Wutz et al. 2002), although A-region mutants do appear 
to have deficiency in crossing active chromosomal regions 
(Engreitz et al. 2013). The A- repeat region, as mentioned 
above, adopts a structural configuration that interacts with 
the repressive PRC2 complex to repress chromatin (Zhao 
et al. 2008). The origin of the A-repeat region was recently 



TABLE 1 . Known cases of functional transposable element sequences within IncRNA 



TE 



LncRNA 



Described activity 



SINEB2 Uchhas 



Alu 



Alu 
L1PA8 



Various 



ANRIL 
SLC7A2-IT1 



ERVB5 XIST (A repeat) 



ERVB4 XIST (C repeat) 



LINE! 



Fendrr 



Translational activator; two 

domains, SINEB2-encoded 

activator coupled to antisense 

recognition domain 
Staufen-mediated decay, through 

antisense base-pairing to 3' UTR 

of coding genes 
Possible DNA recognition domain 
Putative structured domain whose 

mutation causes inherited 

childhood neurodegeneration 
Recruits PRC2 complex through 

formation of a loop structure; 

also interacts with splicing 

factor ASF/SF2; conserved 

across species 
Interacts with YY1 protein (mouse 

and rat only) 
Binds to low-complexity repeats 

in the promoters of at least 

two genes 



shown to have most likely originated as an endogenous retro- 
virus, ERVB5 (Elisaphenko et al. 2008). In contrast, the local- 
ization of XIST seems to be dependent on sequences more 
dispersed throughout the transcript (Wutz et al. 2002), al- 
though a later study using targeting antisense oligonucleo- 
tides implicated the murine-specific C-repeat region in 
correct targeting through unknown mechanisms (Beletskii 
et al. 2001). This targeting is mediated by the specific interac- 
tion of repeat C with the transcription factor, YY1, that di- 
rects XIST to specific genomic loci through DNA binding 
(Jeon and Lee 2011). This region also has a repetitive origin, 
having homology to another endogenous retrovirus, ERVB4 
(Elisaphenko et al. 2008). Most recently, it was shown that 
the conserved Repeat F is part of the core region necessary 
for Jarid2 interaction and may have originated from a DNA 
transposon (Elisaphenko et al. 2008; da Rocha et al. 2014). 
Thus, the distinct functionalities of XIST, targeting and si- 
lencing, appear to have evolved from transposable elements, 
which in combination give XIST at least three distinct pro- 
tein-binding modules as depicted in Figure 3B. 

One intriguing observation is the long acknowledged 
correlation between X- chromosome gene targeting by XIST 
and the density of TEs around the promoters of those genes 
(Wang et al. 2006). It is unclear whether the repeat content 
of XIST is in any way related to the unexplained relation- 
ship between the efficiency of silencing of genes on the X- 
chromosome and the distribution of repeat elements in their 
genomic neighborhood. Recent, sequencing-based maps of 
XIST along the inactive X have revealed a number of such 
relationships, both positive and negative, at unparalleled 
resolution (Engreitz et al. 2013). Strikingly, in both human 
and mouse, the genes on the X-chromo- 
some silenced by XIST are significandy 
and positively correlated to the density 
of MIR and L2 elements around their 
promoters (Wang et al. 2006; Engreitz 
et al. 2013). Inspection of the last exon 
of XIST shows four sets of LINE2 and 
MIR repeats, with conserved orienta- 
tion that presumably have resulted from 
two rounds of sequence duplication 
(Fig. 4A). These repeats in several cases 
overlap regions of elevated vertebrate 
sequence constraint. Together these ob- 
servations lead us to speculate that these 
LINE2-MIR subunits contribute to the 
targeting of XIST to the promoters of 
silenced target genes on the X-chro- 
mosome through Watson-Crick base- 
pairing. Future studies will be required 
to test this hypothesis. TEs can also 
contribute DNA-binding domains to 
IncRNA (Type lib in Fig. 2): A recent 
study of the coronary artery disease-asso- 
ciated IncRNA, ANRIL, showed that Alu 



Reference 



Carrieri et al. (2012) 



Cong and Maquat (201 1 ) 



Holdtetal. (2013) 
Cartault et al. (2012) 



Wutz et al. (2002); 
Elisaphenko et al. 
(2008); Zhaoetal. 
(2008) 

Elisaphenko et al. (2008); 

Jeon and Lee (201 1) 
Groteetal. (2013) 
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elements within its sequence were necessary for its biolog- 
ical activity; and loss of embedded Alu elements reversed 
ANRIL's promotion of growth, adhesion, and motility in 
cell models (Holdt et al. 2013). ANRIL binds to various epi- 
genetic regulatory proteins, including members of the PRC1 
and 2 complexes; intriguingly, Alu sequences complementary 
to those in ANRIL tend to have very specific spacing relative 
to PRC binding sites. Although the implications remain un- 
clear, we speculate that Alu motifs target the ANRIL-PRC2 
complex to complementary genomic sites. 

Another TE-derived DNA-binding domain has been iden- 
tified in the mouse IncRNA, Fendrr, which is necessary for 
mouse heart and body wall development (Grote and Herr- 
mann 2013). The authors showed evidence that Fendrr 
directly binds to at least two gene promoters to which it 
recruits various chromatin-remodeling factors. Inspection 
of the putative DNA-binding domain of Fendrr shows that 
it is derived from a LINE1 element (Fig. 4B). Thus, ANRIL 
and Fendrr constitute two examples in which TE- 
derived fragments mediate the IncRNA genomic targeting 
by Watson-Crick base-pairing, corresponding to the "tran- 
scription factor" model shown in Figure 3B. 

The idea of TEs contributing RNA-binding domains to 
IncRNA (Type Ha) also has been experimentally validated. 
Lynne Maquat's laboratory has shown in a series of papers 
that mRNAs are targeted for degradation by the Staufen pro- 
tein through Alu repeats in their 3' UTR region (Gong and 
Maquat 2011). The recognition by Staufen requires the for- 
mation of a double-stranded RNA substrate, originally iden- 
tified through intramolecular base-pairing (Kim et al. 2007). 
Subsequently, they showed that Staufen targets may also 
form when mRNAs hybridize to IncRNA through comple- 
mentary Alu fragments (Gong and Maquat 2011). A given 
IncRNA can target multiple different mRNAs through shared 
Alu sequences, providing an attractive model for post-tran- 
scriptional gene regulation by IncRNA, with specificity pro- 
vided by TEs. 

Further attesting to the significance of TE-derived 
IncRNA comes from an intriguing recent study on a rare 
neurodegenerative condition, infantile encephalopathy, 
which is restricted to a small population from the island of 
Reunion (Cartault et al. 2012). By genetic mapping, a sin- 
gle-nucleotide disease-causing mutation was discovered in 
a L1PA8 element embedded within a novel intergenic 
IncRNA locus with brain specific expression, SLC7A2-IT1. 
siRNA-mediated knockdown of SLC7A2-IT1 induced 
apoptosis in cultured neuroblastoma cells, suggesting that 
its expression is necessary for neuronal survival. The dis- 
ease-causing mutation is predicted to fall within a structured 
region formed by the repeat element. The single postmortem 
brain sample the authors tested had strongly reduced levels of 
the host RNA, but brain-expressed protein coding genes lo- 
cated proximally to SLC7A2-IT1 in the genome were unaf- 
fected, suggesting that ( 1 ) the neurodegenerative phenotype 
is due to reduced levels of SLC7A2-IT1; (2) the LI element 



somehow controls IncRNA steady state levels; and (3) that 
the IncRNA functions in trans. Another interpretation is 
that the LI element serves to regulate transcription of 
IncRNA at the DNA level, and this hypothesis will have to 
be ruled out before we can definitively state that SLC7A2- 
IT1 represents a TE-derived IncRNA domain. 

Finally, TEs have recently been shown to play an integral 
role in gene regulation by antisense IncRNAs. In a study on 
regulation of the neuronal-specific Uchll mRNA by antisense 
transcripts, the authors unveiled an elegant principle of 
translational regulation: A bipartite antisense contains ( 1 ) a 
"targeting" module, antisense to its target mRNA, with (2) 
a downstream embedded SINEB2 repeat (Carrieri et al. 

2012) . The antisense hybridizes to the mRNA, whereas the 
SINE2B repeat up-regulates its translation through a mecha- 
nism that remains unclear. Removal of the SINE2B element 
completely abrogated the translational effect of the transcript. 
The authors found other similar examples and indeed were 
able to engineer synthetic IncRNAs to activate translation 
of a GFP transgene. It is likely that other antisense IncRNAs 
also bind their sense, coding transcript to effect other regula- 
tory outcomes: BACEl-as binds to and increases the stability 
of BACE1 mRNA (Faghihi et al. 2008), whereas another neu- 
ral antisense transcript, BDNFOS, negatively regulates BDNF 
mRNA (Lipovich et al. 2012). Both of these transcripts con- 
tain multiple exonic TE insertions, although these have not 
yet been strictly linked to function. 

Together these cases provide a diverse body of evidence 
that TE-derived fragments can and do contribute nucleic 
acid and protein-binding modules that are strictly necessary 
for IncRNA's biological activity. 

TEs AND THE EVOLUTION OF COMPLEX IncRNA 
REGULATORY NETWORKS 

One key biological challenge is to understand the genomic 
processes that underlie evolutionary changes, both in general 
and specifically between Homo sapiens and other primates. 
It has been proposed that IncRNA have played an essential 
role in the evolution of developmental gene regulatory net- 
works underlying such changes (Britten and Davidson 
1971; Pollard et al. 2006; Mattick 2009). Recent evidence 
would indeed support a widespread role for IncRNA in the 
regulation of key processes known to have undergone sub- 
stantial evolutionary change between mammals, including 
stem cell pluripotency (Guttman et al. 2011), neurodevelop- 
ment (Ng et al. 2012), and immune function (Carpenter et al. 

2013) . Although recent studies have addressed the evolution 
of IncRNA genes (Necsulea et al. 2014; Washietl et al. 2014), 
the processes governing their functional evolution have not 
been investigated. Transposable elements are likely to have 
contributed to both processes. 

LncRNA have several features distinct from proteins that 
would appear to give them an advantage as gene regulators 
in higher organisms: 
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They do not need to be translated into protein outside the nu- 
cleus, so that they become functional immediately upon 
transcription, and can regulate gene expression directly 
at their site of transcription (in cis), in addition to trans 
targeting. 

They are intrinsically versatile in their molecular interactions: 
They can interact with other molecules through both 
structural and sequence-specific modes, giving them po- 
tential to bridge proteins and nucleic acids. 

They are evolutionarily malleable, since their sequence can 
tolerate insertions or deletions, in contrast to protein-cod- 
ing open reading frame sequences that in most cases can- 
not tolerate such mutations without a loss of function. 

A regulatory role for IncRNA is supported by a wide range of 
observations: A large number are associated with epigenetic 
regulatory proteins (Khalil et al. 2009), they tend to be local- 
ized in the nucleus and chromatin (Clark et al. 2012; Derrien 
et al. 2012), although many are present also in the cytoplasm 
(Gong and Maquat 2011; Carrieri et al. 2012), and indeed a 
growing number of examples attest to their regulation of 
gene expression in both cis and trans (Gupta et al. 2010; 
Maamar et al. 2013). The exaptation of TE-derived modules 
in IncRNA is consistent with such a regulatory role, since 
such modules are likely to be capable of interacting with 
highly relevant regulatory protein complexes (e.g., Alu and 
chromatin regulatory factors), or by specific recognition of 
genes at both transcriptional (i.e., DNA recognition, such 
as ANRIL) or post-transcriptional stages (i.e., RNA recogni- 
tion, such as the Alu-Staufen pathway). 

To the preceding features we may also add a more general 
property of regulatory biomolecules, which is modularity. As 
discussed above, this composition of clearly defined subunits 
of distinct function is fundamental for two reasons: (1) It fa- 
cilitates evolutionary innovation through the simple rear- 
rangement or addition of domains within existing or new 
genes; and (2) modularity is required for the emergence of 
complexity in regulatory networks, since each domain repre- 
sents a molecular interaction in a genetic pathway, and thus 
combinations of domains represent connections between 
such pathways. Such organization is ubiquitous in regulatory 
proteins; for example, a typical regulatory transcription fac- 
tor will combine a DNA-binding domain, a protein-binding 
effector domain (often interacting with a chromatin modify- 
ing complex), and often some kind of sensor (for example, 
the ligand-binding domain, in the case of nuclear hormone 
receptors) (York and O'Malley 2010). The activity of the 
protein is determined by its domain structure, and this 
structure has been repeatedly shuffled through evolution to 
create new variation with altered functionality. One might 
imagine that by simply reshuffling combinations of geno- 
mic-targeting domains/RNA-targeting domains/activating 
or repressing domains, evolution could rapidly give rise to 
novel IncRNAs that connect different components of cellular 
networks. 



In proteins, evolutionary tinkering in the form of domain 
shuffling takes place through insertion of novel coding se- 
quences by a variety of genomic recombination mechanisms 
(Buljan et al. 2010). This process is strictly limited by the re- 
quirement that the newly inserted exon be in the same open 
reading frame as the host gene, limiting the frequency with 
which such events give rise to a viable protein. In the case 
of an inserted internal exon, for example, just one in three in- 
sertions will result in a viable protein (Marsh and Teichmann 
2010; Schad et al. 2013). Similarly, although TEs have been 
shown to occasionally contribute novel exonic sequence to 
protein-coding genes, the insertion of a TE within a coding 
exon, or else the spliced inclusion of an entire TE-derived 
exon, only has a one-in-three probability of creating a viable 
protein, and even then it would likely be a stretch of nonsense 
protein (Sela et al. 2010). In contrast, IncRNA would be ex- 
pected to accept TE sequence much more readily without ad- 
versely affecting their function since the RNA sequence 
function is not dependent on a strict frame or register. 
Indeed, it has been proposed that IncRNAs consist of small 
islands of functional sequence within large stretches of func- 
tionally and evolutionarily neutral sequence (Guttman and 
Rinn 2012). Therefore, IncRNA genes in general are more 
likely to accept new sequence contributions while maintain- 
ing functionality. This is reflected in the vastly higher rate of 
exonization of TEs in IncRNA compared to protein coding 
genes (Kapusta et al. 2013). 

Transposable elements are highly clade specific, a fine 
example being the Alu element, which has expanded mas- 
sively in the primate lineage (Giordano et al. 2007). A conse- 
quence of this is that TE activity might insert lineage-specific 
functional domains into a conserved IncRNA transcript, as 
suggested by Kapusta et al. (2013). This is an attractive mech- 
anism to explain lineage-specific changes in gene networks 
controlled by IncRNA. This is particularly relevant given 
the described functional roles played by IncRNA-embedded 
Alus, including DNA binding (Holdt et al. 2013) and 
mRNA recognition (Gong and Maquat 2011). Interestingly, 
in the latter case, an analogous system evolved in the mouse 
lineage (which lacks Alu), where Staufen-mediated decay is 
instead mediated by recognition of other short repeat ele- 
ments, the Mus-specific Bl, B2, and B4 (Wang et al. 
2013a). Another similar case of analogous RNA function 
again involves Alu in human and B2 in mouse, where both 
are capable of binding and repressing RNA Pol II (Yakovchuk 
et al. 2009). From these findings we might draw two conclu- 
sions: (1) Analogous evolution of TE function might take 
place in IncRNA from different evolutionary branches; and 
(2) TE activity may contribute to IncRNA evolution and 
divergence in particular lineages (similar to that observed 
for TE-driven transcriptional network rewiring) (Bourque 
et al. 2008). 

An excellent example of lineage- specific TE insertion and 
acquisition of function was recently described for ANRIL (He 
et al. 2013). The evolutionary history of ANRIL in eutherians 
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has been complex, apparently gaining exons in primates and 
most other lineages, but shrinking in rodents. In simians, a 
particularly complex gene emerged, and this process was ac- 
companied by the fixing of multiple exonic TE insertions. In 
primates, ANRIL exons have come under selection following 
insertion of TEs. More intriguingly, those same exonic TE 
fragments have also experienced selection following inser- 
tion. Together, these data point to a situation in which a pre- 
existing IncRNA acquired new functional domains through 
the TE insertions. 

One key feature of TEs as targeting sequences is that, by 
their nature, they are highly abundant in the genome (for 
example, >1 x 10 6 Alus; >0.5 x 106 LINEs) (Cordaux and 
Batzer 2009). Thus, any RIDL that operates through base- 
pairing to complementary nucleotide sequence, be it DNA 
or RNA, will have a multitude of potential binding sites 
throughout the genome. Not only are these sites abundant, 
but they are also highly specific, consisting of highly comple- 
mentary fragments often > 1 00 nt long and potentially partic- 
ipating in specific and energetically favorable binding. This 
specificity would appear to be a key advantage of IncRNA 
as a regulatory molecule compared to protein-based tran- 
scription factor, whose genomic binding motifs are unrelated 
to the encoded gene itself. 

The processes by which IncRNAs are born is presently a fo- 
cus of research (Necsulea et al. 2014; Washietl et al. 2014). 
Although outside the scope of this review, it is also worth 
mentioning that, in addition to contributing functional se- 
quence to existing IncRNAs, TEs are also likely to be a driver 
in the birth of new IncRNA genes. This occurs through the 
insertion of novel TE promoter fragments in previously inac- 
tive genomic regions, driving the transcription of IncRNA 
transcripts that eventually acquire function. Kelley and 
Rinn (2012) showed at least one excellent example of this 
in which hERV- derived promoters drive the expression of a 
subset of IncRNAs specifically in pluripotent cells (Kelley 
and Rinn 2012). It will be fascinating to find out whether oth- 
er classes of repeat drive IncRNA expression in other tissue 
types, and whether this mechanism is the principle driver 
of new IncRNA gene birth. It is also worth mentioning that 
such transcripts will necessarily carry some TE sequence at 
their 5' end, which could conceivably contain functional 
elements. 

We might consider two distinct functional roles of exapted 
TEs that will result in different distinct evolutionary patterns: 
function through structure (Type I) or function through se- 
quence (Type II) (Fig. 2). In the case in which this involves 
the adoption of a structure for protein binding, then we 
might expect that the TE fragment will confer binding of 
the host IncRNA to natural partners of the TE, specifically 
the TE RNP (Fig. 2; Blackwell et al. 2012; Goodier et al. 
2013). Such RNPs are known to interact with a wide range 
of proteins, including those with regulatory functions of clear 
relevance to IncRNA function (Blackwell et al. 2012). Thus, 
TE protein partners represent obvious candidates to interact 



with TE- containing IncRNAs. For the TE- derived fragments 
of this type, we would expect them to undergo purifying se- 
lection on RNA structure, with characteristic compensatory 
mutations (Smith et al. 2013), exactly as has been observed 
for the XIST A-repeats (Duszczyk et al. 2011). 

On the other hand, exapted exonic TEs might function 
purely at the sequence level through hybridization to comple- 
mentary sequences in DNA or RNA. In this case, we would 
expect evolutionary constraint on RNA sequence but not 
necessarily on structure. More specifically, we would expect 
constraint at the complementary sites to which the RNA is 
binding, meaning that there should be correspondence in 
the precise subregion of the repeat consensus found in the 
RNA and in its genomic binding site. Widespread conserva- 
tion of intergenic TE fragments has already been observed 
(Lowe et al. 2007). Therefore, these differing constraints on 
exapted TE sequence may enable us to distinguish Type I 
and Type II domains (see below). 

Here, we have speculated on the possible role that TEs have 
played in the evolution of regulatory IncRNAs. We conclude 
that the RIDL hypothesis of IncRNA evolution through ac- 
quisition of TE-derived functional domains is consistent 
with the observed rapid evolution of regulatory IncRNA. 
In the following section, we propose how we might go about 
systematically identifying exapted TE domains using various 
genomic analysis, including exploiting characteristic evolu- 
tionary patterns that such TE fragments might undergo. 

HOW TO FIND FUNCTIONAL TE DOMAINS 
GENOME-WIDE 

The hypothesis that TEs have extensively contributed to 
IncRNA functional domains results in a number of test- 
able predictions about their sequence characteristics that 
might be used to discover such exapted TE domains. In 
this section, we lay out some such criteria and discuss their 
application. 

Identifying TE-derived IncRNA domains will be challeng- 
ing for a number of reasons, not least the vast number of 
these sequences in the genome and the difficulty of using evo- 
lutionary filters on lineage-specific TE insertions. First, it 
is likely that many, if not the majority of exapted TE sequenc- 
es will accumulate sequence changes such that we cannot 
identify them as repeat-derived sequence. A good example 
of this is the case of XIST, where the A-repeats are not an- 
notated as having a TE origin by RepeatMasker, but never- 
theless a more focused study using BLAST showed them 
to derive from endogenous retrovirus (Elisaphenko et al. 
2008). Thus, these studies are likely to have poor sensitivity 
for genuine exapted TEs. 

It is important to note that the proportion of TEs extant in 
the genome that have function is unknown. Therefore, we 
must consider the possibility that genome-scale catalogs 
of TE-derived IncRNAs may include large numbers, and 
possibly a majority, of nonfunctional sequences. That is, 
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the majority of TE exonic insertions may not contribute a 
beneficial change to IncRNA activity, and their sequence 
will either evolve under random drift (for neutral or weakly 
deleterious insertions) or be eliminated from the popula- 
tion (strongly deleterious insertions). Therefore, we must 
consider it likely that genome-scale catalogs of TE-derived 
IncRNA will be dominated by nonfunctional sequences un- 
der neutral evolution, and the hallmarks of functionality in- 
dicated below will have relatively weak signals. This effect 
will correlate with the evolutionary age of the TE family: 
More recently, transposed repeats will be less likely to have 
acquired function and will have a smaller fraction of func- 
tional members. 

If exapted TEs come under purifying selection, then we 
may make another prediction that the signal from most of 
the genomic filters described in the following section should 
become more pronounced for each TE family as a function of 
the time since that family was active; in other words, we ex- 
pect to have the most power to identify signatures of exap- 
tation in older TEs, as the difference between neutral, 
nonfunctional instances compared to exapted instances be- 
comes more pronounced. Unfortunately, these same cases 
may be the hardest to identify as being repeat derived due 
to their age, as mentioned above. 

We must also be careful how to interpret evidence of evo- 
lutionary selection: Such selection may be acting on a DNA 
or RNA phenotype. Specifically, a TE sequence may be con- 
served because it is acting through DNA, perhaps as a tran- 
scription factor binding site (Johnson et al. 2006), and its 
transcription within a IncRNA is purely coincidental. With 
these caveats in mind, we here discuss criteria for the ge- 
nome-wide discovery of candidate RIDL elements. 



TE subregion overrepresentation in IncRNA 

TE families are comprised of a consensus motif that contains 
distinct subregions that have distinct sequence, structural, or 
functional properties (e.g., the UTRs and two ORFs of the 
LINE1 element) (Gifford et al. 2013; Goodier et al. 2013). 
Additionally, TEs tend to not insert their whole sequence 
during a novel insertion but rather insert a subfragment of 
their consensus motif, often variable lengths originating at 
the 3' end due to incomplete reverse transcription (Lowe 
et al. 2007). We might expect that if particular subregions be- 
come exapted following insertion, then they will be overrep- 
resented in the exons of host IncRNA, meaning that the 
frequency of observing particular fragments of a TE within 
IncRNA exons may differ from the genome as a whole. In 
Figure 5, we show preliminary data from our group, demon- 
strating the inclusion profile of the LINE 1 -like repeat, HALL 
The base-level inclusion profile in IncRNA exons is distinct 
from that of introns due to the presence of a peak of insertion 
specific to elements found in exons corresponding to a posi- 
tion around 1700 nt within the HAL1 consensus (indicated 
by an arrow), lying in the ORF region. 

Such analysis of insertion profiles may be a useful method 
to filter functional IncRNA domains originating from TEs, al- 
though care should be taken in interpreting nonrandom pro- 
files originating from processes such as exonization, which 
cannot be assumed to be indicative of function a priori. 
Once overrepresented TE subregions are found in IncRNA 
exons, the function of those regions in their endogenous 
TE transcript may hold clues to their role in the IncRNA. 
We predict that the most pronounced insertion profiles will 
reflect structures or protein-binding domains within TEs 



Base-level overrepresentation 

Providing a large proportion of a particular repeat family 
have been exapted, their sequence may be overrepresented 
as a fraction of IncRNA exonic sequence compared to geno- 
mic sequence as a whole. This has been observed for multiple 
TE families, whose sequences are strongly and statistically sig- 
nificantly enriched in IncRNA exons, particularly various 
classes of endogenous retroviruses (HERV, MLT, LTR) (Kel- 
ley and Rinn 2012). Perhaps surprisingly, other classes of TE 
were also found to be significantly underrepresented in 
IncRNA exons, including various Alu subtypes; this effect 
may equally result from TE functionality since potent TE 
fragments may be selected against in many IncRNA hosts, 
where their presence is somehow detrimental or inappropri- 
ate to function, and only maintained in a subset, where they 
confer a selective advantage. This is consistent with the 
various documented activities of Alu sequence, both in isola- 
tion and in IncRNA contexts (Yakovchuk et al. 2009; Gong 
and Maquat 2011). Thus, counterintuitively, we may also 
include underrepresentation as a potential signature of TE 
exaptation. 



LncRNA 
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FIGURE 5. An approach to search for functional TE modules through 
insertion profiles. Preliminary data are shown for the LINEl-related 
HAL1 repeat. Top: a hypothetical IncRNA, containing TE fragment in- 
sertions in exonic and intronic regions (red arrows). Bottom: The plot 
shows base-level insertion frequency (y-axis), i.e., the probability of a 
given nucleotide being found in inserted fragments, with respect to po- 
sition within the HAL1 consensus sequence (x-axis). Light blue and dark 
blue lines denote intronic and exonic data, respectively. The number of 
distinct insertion events upon which the data are based is shown above 
the plot. 
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(Type I RIDLs) due to their specificity and relatively localized 
nature. In contrast, insertion profiles of Type II RIDLs will be 
expected to correspond to the profiles of their genomic or 
transcriptomic homology sites. 

Strand bias 

If the function of a TE motif depends on the strandedness in 
which it is transcribed, then exapted TEs should preferential- 
ly be retained with a particular strand orientation relative 
to the host IncRNA exon. We have identified numerous cases 
of such strand bias for families of exonic TEs (R Johnson and 
R Guigo, unpubl.). A crucial consideration in these cases is 
that extreme strand bias will also be observed where TEs 
are contributing to IncRNA gene structures (Kapusta et al. 
2013): Splice sites, promoters, or entire exons contributed 
by TEs will almost always occur through an element on 
one specific strand of the TE consensus, and therefore the re- 
sulting exonic TE regions will have a consistent strandedness 
with respect to the host transcript. 

Evolutionary conservation 

Conservation is possibly the most powerful argument that 
can be used for function. Functional TE fragments should 
in principle display distinct evolutionary rates compared to 
similar fragments outside IncRNA exons that are assumed 
to be nonfunctional. Such a signature of selection was report- 
ed by Kapusta et al. (2013). However, this analysis was flawed 
since they specifically filtered intronic TE sequence to remove 
potential functional sequence that overlapped active chro- 
matin marks without performing the same filtering on exonic 
TE sequences to which they were compared. Indeed, manual 
inspection reveals many instances of evolutionary conserva- 
tion of TEs within IncRNA that in fact overlap genomic reg- 
ulatory sites, i.e., the conservation is likely to arise as a result 
of DNA function of the sequence rather than RNA function, 
as has been observed previously (Lowe et al. 2007). This 
means that equal filtering of both exonic and intronic TEs 
must be carried out for such analyses to correctly understand 
the source of sequence conservation (either DNA or RNA 
function). Our unpublished global comparison of PhyloP 
base-level conservation of exonic and intronic sequence 
across all TE families does not reveal a significant signal of se- 
lection (R Johnson and R Guigo, unpubl.). 

However, this is not to say that individual repeat families 
may not have evolutionarily conserved sequence in exons. 
In support of this, there are many cases of apparent conser- 
vation of candidate RIDL sequences. By analyzing evolution- 
ary conservation at each TE type in turn, we can find 
numerous cases with very strong evidence for purifying selec- 
tion (Pollard et al. 2010; R Johnson and R Guigo, unpubl). 
One example is shown in Figure 6, in which exons of the 
TUG1 transcript contain at least two evolutionarily con- 
served regions originating from Charlie 15a and MLT1K 



transposons. Importantly, there is no evidence that these 
repeats function at the DNA level as revealed by absence of 
evidence of DNasel hypersensitivity or chromatin modifica- 
tions, consistent with the hypothesis that the evolutionary se- 
lection is here acting on an RNA-based phenotype. 

Finally, an important consideration in the analysis of evo- 
lutionary conservation patterns on IncRNA will be exactly 
what is being conserved: sequence or structure? Most analyses 
of genomic conservation use sequence conservation, which 
likely has poor sensitivity in detecting the conservation of 
RNA structures. In contrast, a number of methods to specifi- 
cally detect patterns of conservation in RNA structure have 
been presented, with increasing sensitivity (Washietl et al. 
2005; Pedersen et al. 2006; Smith et al. 2013). It maybe pos- 
sible to take advantage of these differences in evolutionary 
forces to not only find evidence for selection but also to predict 
the function of repeat. Specifically, we predict that exapted 
TEs that work at the structural level (Type I) should display 
signals of conservation using methods adapted to RNA struc- 
ture evolution such as ECS (Smith et al. 2013), whereas Type II 
TEs that depend on hybridization will be detected by more 
standard filters for purifying sequence selection such as 
PhastCons or PhyloP (Siepel et al. 2005; Pollard et al. 2010). 

Secondary structures 

Exapted TE sequences may contain secondary structures 
that mediate their activity, and this may be reflected in a stat- 
istical overrepresentation of structured sequence. Many TEs 
are known to be highly structured, including Alu (Mariner 
et al. 2008). A simple metric, such as nucleotide-level pro- 
pensity for base-pairing, could be used to search for statistical 
enrichment. 

Combinatorics 

Recurring combinations of TEs may be apparent in 
IncRNA at nonrandom frequencies. Such combinations are 
observed in proteins, for example, in the frequent com- 
bination of KRAB-box repressor domains with zinc finger 
DNA-binding modules (Huntley et al. 2006). A possible ex- 
ample of this was mentioned previously in the context of 
XIST (Fig. 4A). 

Cellular localization 

Some functional TE domains have been shown to associate 
with particular cellular compartments. For example, the 
SINEB2 domain of the Uchll-as transcript regulates localiza- 
tion to the ribosome (Carrieri et al. 2012), or the Alu domain 
of ANRIL with chromatin (Holdt et al. 2013). Furthermore, 
TE RNAs in isolation tend to localize at different sites within 
the cell, e.g., SVA in the cytoplasm and Alu in the nucleus 
(Goodier et al. 2010) and the signal driving this localization 
presumably would act on IncRNA hosting those same TEs. 
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Similarly, the analysis of subcellular RNAseq data may reveal 
enrichments of these and other TEs that could be acting as 
localization signals for IncRNA or else point to binding to 
other molecules with specific localization (Derrien et al. 
2012;Djebali et al. 2012). 

Protein interaction 

If TE sequences confer protein interaction domains on 
IncRNA, then we might expect to find signatures of this in ex- 
perimental data sets of protein-TE interactions. The most 
obvious approach might be to search data sets such as recent- 
ly published whole-genome maps of protein-RNA interac- 
tions represented by iCLIP or the related PAR-CLIP (Ule 
et al. 2003). One would expect to find protein-RNA interac- 
tion sites overlapping TE-derived fragments within IncRNA 
at higher than expected frequencies. A complementary ap- 
proach, recently published by Lunyak's group, would be to 
experimentally catalog the protein-interactome of a given 
TE RNA. Here, the authors used Alu RNA as bait to identify 
the full set of interacting partners, finding a large number of 
DNA repair and epigenetic proteins (Blackwell et al. 2012). 
We might expect that such interactions are also retained by 
Alu fragments that occur within IncRNA, raising the possibil- 
ity that Alu elements may form docking sites to chromatin 
proteins for IncRNA. 

OUTLOOK 

In this review, we have argued that transposable elements 
represent a fundamental and versatile source of novel func- 
tional domains that facilitate the evolution of IncRNA. If 
this is correct, then the identification and characterization 
of these will represent a breakthrough in our ability to predict 
and manipulate functional IncRNA. A small but compelling 
set of examples attest to this, among them the functionally 
validated IncRNAs, XIST, ANRIL, RoR, and Uchll-as. The 
demonstration that two distinct functional modules of 
XIST, the intensively studied and indispensable mammalian 
X-chromosome inactivation IncRNA, represent a powerful 
clue that such a mechanism may be widespread in IncRNA 
evolution. In addition to piecemeal identification of exapted 
TEs, we present a framework for genome-level identification 
of candidates. Hopefully, these data will eventually be inte- 
grated into methods that can accurately infer the activity of 
IncRNA based on their sequence alone. 
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NOTE ADDED IN PROOF 

While this paper was in press, another case of Alu-mediated DNA 
targeting of a IncRNA was published by Anindya Dutta and col- 
leagues (Negishi et al. 2014). 
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